Partly Supervised Uighur Morpheme Segmentation

نویسندگان

  • Mijit Ablimit
  • Tatsuya Kawahara
چکیده

This paper introduces Uighur morpheme segmentation, which is a basic part of the comprehensive effort of the Uighur language corpus compilation, conducted at Xinjiang University in cooperation with Kyoto University. Uighur is an agglutinative language with word structures formed by productive affixation of derivational and inflectional suffixes to stems. Derivational suffixes change the meaning of the stems, while inflectional suffixes define grammatical functions, such as cases, of the stems. The surface realization of words is also constrained by phonetic rules such as phonetic harmony and vowel weakening, but the surface form of the stem is basically unchanged except for the last vowel. For example, the words “adam+lar, adam+ni, adam+ga, adam+ning, adam+dak” are formed by attaching different suffixes “lar, ni, ga, ning, dak” to the stem “adam (meaning person)”. There are also complex suffixes or compound suffixes. They cause a huge number of combinations, thus the morpheme segmentation is the vital part of the Uighur language analysis. We compiled lists of 38500 stems and 325 singular suffixes to cover most of general words. Then, a list of compound suffixes is collected in an unsupervised manner from our corpus of 200K words by matching with the basic list. With manual checking, 5880 compound suffixes were obtained. For automatic morpheme segmentation, we apply a forward and backward matching algorithm based on the list. One of the biggest problems is vowel weakening, that is, the last vowel of the stem “a” or “ä” is often replaced by another vowel “i” or “e”. The phenomenon is observed The work is funded by Natural Science Fund of China (No:60662002) for 12% of the words in our corpus. Thus, we have devised substitution rules, but these cause ambiguity in the morpheme segmentation. When more than one segmentation hypotheses are generated, the hypothesis with a longer stem is preferred; this is a safe heuristics. Phonetic harmony is also a key factor that controls the stem-suffix connection and syllable concatenation. Thus, we have also introduced phonetic harmony rules which constrain the connection of the stems and suffixes in terms of the smooth articulation. For example, some voiced consonant at the end of a stem must be followed by a suffix starting with a voiced consonant. This constraint will effectively reduce the ambiguity. The method was evaluated with 18400 words chosen from our corpus, and the accuracy of stem-suffix boundary detection is 96% and the accuracy of all stem/suffix segmentation is 92%. The result is encouraging since stems of some words, such as new words imported from English, are not included in the stem list. We are investigating an automated method based on a statistical model to cope with them.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A semi-supervised learning approach for morpheme segmentation for an Arabic dialect

We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Arabic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine translation (SMT) system. Experiments show that our approach is less sensitive to the availability of annotated stems than a previous rule-based approach an...

متن کامل

Morpheme Segmentation and Concatenation Approaches for Uyghur LVCSR

In this paper, various kinds of sub-word lexica are thoroughly investigated under the framework of Uyghur LVCSR system. Experimental results show that it is inefficient to directly model based on word units or small units like morpheme or even syllable units. It is observed that an optimal sub-word unit set between word and morpheme units can better fit for ASR system. In order to select best u...

متن کامل

Semi-supervised Learning for Mongolian Morphological Segmentation

Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First,...

متن کامل

Weakly Supervised Morphology Learning for Agglutinating Languages Using Small Training Sets

The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a ne...

متن کامل

Rule-based Person Name Recognition for Xinjiang Minority Languages

Xinjiang multi-nationality name entity recognition is an important part in multi-language processing. In this paper, we analyze the patterns of Uighur and Kazak person names, and perform the name identity recognition using rule-based approach. We also propose and implement the rules for Uighur and Kazak word segmentation.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008